The geocoding tool automatically retrieves point coordinates, administrative unit keys, and grid cell IDs. Spatial joins based on coordinates for other units:
Constituencies
Administrative units across time (e.g., harmonized territorial status)
Sources: OpenStreetMap / GEOFABRIK (2018), City of Cologne (2014), Leibniz Institute of Ecological Urban and Regional Development (2018), Statistical Offices of the Federation and the Länder (2016), and German Environmental Agency / EIONET Central Data Repository (2016) / Jünger, 2019
Data Linking
Linking via common identifiers is most commonly used but comes with challenges (e.g., territorial status and land reforms? Comparable units? Heterogeneity within units?).
Spatial linking methods (examples) I
1:1
(sf::st_join())
Distances
(sf::st_distance())
Sources: German Environmental Agency / EIONET Central Data Repository (2016) and OpenStreetMap / GEOFABRIK (2018) / Jünger, 2019
Spatial linking methods (examples) II
Filter methods
(sf::st_filter() or terra::vect(..., filter = ...))
Buffer zones
(sf::st_buffer())
Sources: Leibniz Institute of Ecological Urban and Regional Development (2018) and Statistical Offices of the Federation and the Länder (2016) / Jünger, 2019
Cheatsheet: Spatial Operations
An overview of spatial operations using the sf package can be accessed here.
Data Aggregation
If you want to aggregate attributes and geometries of a shapefile, you can rely on st_combine(x) , st_union(x,y) and st_intersection(x, y) to combine shapefiles, resolve borders and return the intersection of two shapefiles.
For raster data, you can aggregate with the function terra::aggregate()(if you have matching raster files) in combination with terra::resample() (if your raster files don’t match).
Say we’re interested in the impact of neighborhood characteristics (e.g., mobility infrastructure) on individual-level attitudes towards energy transition.
We plan to conduct a survey which is representative of the population of Germany.
https://imgflip.com/i/9ptcuu
Synthetic georeferenced survey data
We have added a synthetic dataset of 2,000 geocoordinates in the ./data/ folder (aggregated to 1 sq km centroids). Initially, it was based on a sample of the georeferenced GESIS Panel.
synthetic_survey_coordinates <-readRDS("./data/synthetic_survey_coordinates.rds")tmaptools::read_osm( synthetic_survey_coordinates, type ="esri-topo") |> terra::rast() |>tm_shape() +tm_rgb() +tm_shape(synthetic_survey_coordinates) +tm_dots(size =2, col ="red")
Correspondence table
As in any survey that deals with addresses, we need a correspondence table of the distinct identifiers.
We ‘ask’ respondents for some standard sociodemographics. But we also include an item from the GLES Panel on energy transformation: “From 2030, no more new cars with petrol or diesel engines are to be registered in Germany. How much do you agree?” (entrans). Since we cannot share the actual data, we created fake data using the faux package.
Better access to charging infrastructure means higher support for energy transformation.
Rural-urban divide
Higher population density means higher support for energy transformation.
District-level data
We already have most of our information on the district level from yesterday.
district_attributes <-# load district shapefile sf::read_sf("./data/VG250_KRS.shp") |># add attribute table dplyr::left_join( readr::read_delim("./data/attributes_districts.csv", delim =";"), by ="AGS" )
District operationalization
Access to charging infrastructure
Charging stations per 1000 inhabitants in a district
Rural-urban divide
Population Density in a district
Access to charging infrastructure
Luckily, we did something similar yesterday!
charger_data <-# Load charging station points readr::read_delim("./data/charging_points_ger.csv", delim =";") |># Filter out rows with missing longitude or latitude dplyr::filter(!is.na(longitude) &!is.na(latitude)) |># Convert data frame to sf object sf::st_as_sf(coords =c("longitude", "latitude"), crs =4326) |># Reproject the spatial data to the desired CRS (Coordinate Reference System) sf::st_transform(crs = sf::st_crs(district_attributes))aggregated_charger_data <- charger_data |># spatial join district ids sf::st_join(district_attributes |> dplyr::select(AGS)) |># Group by district ID dplyr::group_by(AGS) |># Summarize the number of chargers in each district dplyr::summarise(charger_count = dplyr::n()) |># Drop geometry column sf::st_drop_geometry()district_attributes <-# Left join with sampling area attributes dplyr::left_join( district_attributes, aggregated_charger_data, by ="AGS" ) |># Calculate charger density per 1000 population dplyr::mutate(charger_dens = (charger_count *1000) / population)
Rural-urban divide
Our attribute table contains the number of inhabitants per district but not the population density. Therefore, we need to calculate the area of the district.
# calculate area of districts# areas will always be calculated# in units according to the CRS sf::st_area(district_attributes) |>head(4)
# calculating population densitydistrict_attributes <- district_attributes |># calculate area of districts (areas will always# be calculated in units according to the CRS ) dplyr::mutate(area = sf::st_area(district_attributes) ) |># change unit to square kilometers dplyr::mutate(area_km2 = units::set_units(area, km^2) ) |># recode variable as numeric dplyr::mutate(area_km2 =as.numeric(area_km2) ) |># calculate population density dplyr::mutate(pop_dens = population/ area_km2 )tm_shape(district_attributes) +tm_fill("pop_dens", fill.scale =tm_scale(breaks =c(0,100,200,400,800,1600,3200, Inf) ) ) +tm_layout(legend.outside =TRUE)
Respondents in districts
We have population density on the district level. Since our analysis focuses on the individual level, we can spatially join the information to our fake respondents’ coordinates.
district_linked_df <- district_attributes |> sf::st_transform(sf::st_crs(synthetic_survey_coordinates)) |># keeping just the variables we want dplyr::select(charger_dens, publictransport_meandist, pop_dens) |># since we want to join district to# respondent defining coordinates first sf::st_join(x = synthetic_survey_coordinates,# district data secondy = _,# some points may lie on the border# choosing intersects join = sf::st_intersects ) |># drop our coordinates for data protection sf::st_drop_geometry()
We have our nice fake coordinates and know we also have variations in some districts (e.g., Cologne) concerning e-car mobility. Let’s try to operationalize the variables on a smaller level of aggregation.
Access to charging infrastructure > Charging stations in a 5000m buffer
Rural-urban divide > Population in a 5000m buffer
Charging stations in 5000m buffer
The procedure for calculating the number of chargers in a 5km buffer is very similar to calculating the chargers in a district.
# Create 5000m buffers around the fake coordinatesbuffers <- synthetic_survey_coordinates |> sf::st_buffer(dist =5000)# Perform intersection between buffers and points_sfinter <- sf::st_intersects(buffers, charger_data |> sf::st_transform(3035))# Count points within each buffercoordinate_linked_df <- synthetic_survey_coordinates |> dplyr::mutate(num_charger =lengths(inter))
Distance calculation II
sf::st_distance() will calculate between all respondents and all train stations resulting in a huge matrix. We can make our lives easier by first identifying the nearest station and then calculating the distance.
# Find the nearest charging station nearest_charger <- sf::st_nearest_feature( synthetic_survey_coordinates, charger_data |> sf::st_transform(3035) )# Calculate the distance between each point in# fake_coordinates & its nearest charging stationcoordinate_linked_df <- coordinate_linked_df |> dplyr::mutate(charger_distances = sf::st_distance( synthetic_survey_coordinates, charger_data[nearest_charger,] |> sf::st_transform(3035), by_element =TRUE ) |>as.vector() )
Distance calculation II
# add a column for the distancescoordinate_linked_df <- coordinate_linked_df |> dplyr::mutate(# Calculate distances in kilometers charg_dist_km =as.numeric(charger_distances) /1000) summary(coordinate_linked_df$charg_dist_km)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01544 0.36216 0.72309 1.39690 1.80339 13.65649
Population buffers
…and we’re not yet done: we still need the population in the neighborhood. Let’s calculate buffers of 5000 meters and add the population mean values to our dataset.
# download data & extract informationinhabitants <- z11::z11_get_1km_attribute(Einwohner)# spatially link "on the fly"population_buffers <- terra::extract( inhabitants, synthetic_survey_coordinates |> sf::st_buffer(5000), fun = mean,na.rm =TRUE,ID =FALSE,raw =TRUE ) |>unlist()# link with data coordinate_linked_df <- coordinate_linked_df |> dplyr::mutate(population_buffer = population_buffers)
Join with Survey
We hope you’re not tired of joining data tables. Since we care a bit more about data protection, we have yet another joining task: to join the information we received using our (protected) fake coordinates to the actual survey data via the correspondence table.
# last joins for nowfake_survey_data_spatial <-# first join the id dplyr::left_join(correspondence_table, district_linked_df, by ="id") |> dplyr::left_join(coordinate_linked_df, by ="id") |># join the survey data dplyr::left_join(fake_survey_data, by ="id") |> dplyr::select(-id)